1 The Dataset

1.1 Characterization

The dataset chosen for the analysis is the following:

  • Title: Characterizing gene expression in lung tissue of COPD subjects using RNA-seq
  • Overall design: Examination of lung tissue in COPD patients versus normal control
  • SRA: SRP041538
  • BioProject: PRJNA245811
  • GEO: GSE57148

They analyzed gene expression profiling of lung tissue to define molecular pathway of COPD using recent RNA sequencing technology. Lung tissue was obtained from 98 COPD subjects and 91 subjects with normal spirometry. RNA isolated from these samples was processed with RNA-seq using HiSeq 2000. Gene expression measurements were calculated using Cufflinks software.

1.2 The Disease

Chronic obstructive pulmonary disease (COPD) is the third leading cause of death worldwide, causing 3.23 million deaths in 2019 (“Chronic Obstructive Pulmonary Disease (COPD) - World Health Organization 2020). It is a common, preventable and treatable chronic lung disease which affects approximately 300 million men and women worldwide, mainly in low to middle-income countries (Ruvuna and Sood 2020).

As it develops, abnormalities in the small airways of the lungs lead to limitation of airflow in and out of the lungs. Several processes cause the airways to become narrow. There may be destruction of parts of the lung, mucus blocking the airways, and inflammation and swelling of the airway lining. It usually manifests itself through cough, wheezing and difficulty breathing.

1.3 Initial data handling

The files were downloaded from recount2 using the study acession id “SRP041538”. The data was then scaled and the patients were divided into conditions reflecting whether or not they had COPD, with the reference being the individuals with normal spirometry (NOR). The samples are formatted according to this specific design and then filtered by average gene counts following a threshold adequate for the analysis pipeline used (Sha, Phan, and Wang 2015).

# Importing libraries
library(DiagrammeR)
library(recount)
library(DESeq2)
library(ggplot2)
library(tidyverse)
library(ggrepel)
library(DOSE)
library(DT)
require(biomaRt)
library(pheatmap)
library(fgsea)
library(reactome.db)

# Defining study acession
study_acession <- "SRP041538"

# Downloading data, if not already available
if (!file.exists(file.path(study_acession, "rse_gene.Rdata"))) {
  download_study(study_acession)
}

load(file.path(study_acession, "rse_gene.Rdata"))

# Reading and labeling data
rse <- scale_counts(rse_gene)
rse$condition <- sapply(strsplit(as.character(rse$title), "-"), `[`, 2)

# Formatting dataset according to the specific design
dds <- DESeqDataSet(rse, ~condition)
dds$condition <- relevel(dds$condition, ref = "NOR")

# Filtering samples
keepers <- (rowSums(counts(dds)) / ncol(dds)) >= quantile((rowSums(counts(dds)) / ncol(dds)), .2)
dds <- dds[keepers, ]

2 RNA-Seq Analysis

2.1 Differential expression analysis

We’ll perform the differential expression analysis using DESeq2. The referencial group is that of the individuals with normal spirometry (NOR) and the alpha error cuttoff shall be 0.05. Gene symbols are the way they’ll be referred to in the next plots and are HGNC symbols whenever possible (in the absence of an equivalent HGNC symbol, the standard EnsemblID shall be used). The genes exhibited in this following table are those statiscally significant and with an absolute fold change > 2:

dds <- DESeq(dds)

res <- as_tibble(results(dds, alpha = 0.05), rownames = "gene") %>%
  rename(logFC = log2FoldChange) %>%
  rename(FDR = padj)

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "www.ensembl.org")

genes <- gsub("\\..*", "", res$gene)

genemap <- getBM(
  attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"),
  filters = "ensembl_gene_id",
  values = genes,
  mart = ensembl
)

check_local_symbol <- function(ref, local_symbols) {
  if (ref %in% local_symbols$group_name) {
    hgnc_symbol <- local_symbols$value[local_symbols$group_name == ref][1]
    if (!hgnc_symbol == "") {
      return(hgnc_symbol)
    } else {
      return(NA)
    }
  } else {
    return(NA)
  }
}

local_symbols <- data.frame(rowData(dds)$symbol)
local_symbols$value[is.na(local_symbols$value)] <- ""
gene_id_list <- NULL

gene_id_lister <- function(ref, genemap) {
  if (ref %in% genemap$ensembl_gene_id) {
    hgnc_symbol <- genemap$hgnc_symbol[genemap$ensembl_gene_id == ref][1]
    if (!hgnc_symbol == "") {
      return(hgnc_symbol)
    } else {
      ref <- check_local_symbol(ref, local_symbols)
      return(ref)
    }
  } else {
    ref <- check_local_symbol(ref, local_symbols)
    return(ref)
  }
}

gene_id_list <- sapply(genes, gene_id_lister, genemap)

res <- res %>%
  add_column(symbol = gene_id_list) %>%
  relocate(symbol, .after = gene)

deseq_genes <- res %>%
  as.data.frame() %>%
  filter(FDR < 0.05) %>%
  dplyr::select(EnsemblID = gene, symbol, logFC, pvalue, FDR) %>%
  arrange(FDR)
deseq_genes_count <- nrow(deseq_genes)
output_note <- paste0("\nTotal differential expressed genes with adjusted p<0.05: ", deseq_genes_count, "\n(top 1000 shown above)")
deseq_genes %>%
  datatable()

3 Data Visualization

3.1 PCA plot

tdds <- vst(dds, blind = T)

pcaData <- plotPCA(tdds, intgroup = "condition", returnData = TRUE)

percentVar <- round(100 * attr(pcaData, "percentVar"))

ggplot(pcaData, aes(PC1, PC2, color = group)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed() +
  ggtitle("PCA plot of COPD patients x Healthy individuals")

3.2 Volcano plot

res <- res %>%
  mutate(
    Expression = case_when(
      logFC >= log2(2) & FDR <= 0.05 ~ "Up-regulated",
      logFC <= -log2(2) & FDR <= 0.05 ~ "Down-regulated",
      TRUE ~ "Unchanged"
    )
  ) %>%
  mutate(
    Significance = case_when(
      abs(logFC) >= log2(2) & FDR <= 0.05 & FDR > 0.01 ~ "FDR 0.05",
      abs(logFC) >= log2(2) & FDR <= 0.01 & FDR > 0.001 ~ "FDR 0.01",
      abs(logFC) >= log2(2) & FDR <= 0.001 ~ "FDR 0.001",
      TRUE ~ "Unchanged"
    )
  )

top_bottom <- function(top, res) {
  top_genes <- bind_rows(
    res %>%
      filter(Expression == "Up-regulated") %>%
      arrange(FDR, desc(abs(logFC))) %>%
      head(top),
    res %>%
      filter(Expression == "Down-regulated") %>%
      arrange(FDR, desc(abs(logFC))) %>%
      head(top)
  )
  return(top_genes)
}

volcano_plot <- ggplot(res, aes(logFC, -log(FDR, 10))) +
  geom_point(aes(color = Significance), size = 2 / 5) +
  xlab(expression("log"[2] * "(Fold change)")) +
  ylab(expression("-log"[10] * "FDR")) +
  scale_color_viridis_d(option = "B") +
  guides(colour = guide_legend(override.aes = list(size = 1.5))) +
  geom_vline(xintercept = c(-1, 1), col = "red") +
  geom_hline(yintercept = -log10(0.05), col = "red") +
  ggtitle("Lung tissue transcriptome of COPD patients x Healthy individuals") +
  geom_label_repel(
    data = top_bottom(40, res),
    mapping = aes(logFC, -log(FDR, 10), label = symbol),
    size = 2,
    max.overlaps = Inf
  )

volcano_plot

3.3 Heatmap

tddsa <- assay(tdds)
rownames(tddsa) <- gsub("\\..*", "", row.names(tddsa))

filtered_res <- res %>%
  filter(!is.na(symbol))

heat_genes <- top_bottom(10, filtered_res)
heat_genes$gene <- gsub("\\..*", "", heat_genes$gene)

topDE <- tddsa[heat_genes$gene, ]
rownames(topDE) <- heat_genes$symbol

col_annotation <- as.data.frame(colData(tdds)[, "condition"])
colnames(col_annotation) <- "Condition"
rownames(col_annotation) <- colnames(topDE)

row_annotation <- as.data.frame(heat_genes$Expression)
colnames(row_annotation) <- "Expression"
rownames(row_annotation) <- heat_genes$symbol

pheatmap(topDE,
  scale = "row",
  clustering_distance_rows = "correlation",
  annotation_col = col_annotation,
  annotation_row = row_annotation,
  show_colnames = FALSE,
  annotation_names_row = FALSE,
  annotation_names_col = FALSE,
  main = "20 genes with the strongest differential expression"
)

3.4 GSEA

genemap$entrezgene_id[is.null(genemap$entrezgene_id)] <- NA

entrez_id_lister <- function(ref, genemap) {
  if (ref %in% genemap$ensembl_gene_id) {
    entrez_id <- genemap$entrezgene_id[genemap$ensembl_gene_id == ref][1]
    if (!is.na(entrez_id)) {
      return(entrez_id)
    } else {
      ref <- check_local_symbol(ref, local_symbols)
      return(NA)
    }
  } else {
    ref <- check_local_symbol(ref, local_symbols)
    return(NA)
  }
}

entrez_id_list <- sapply(genes, entrez_id_lister, genemap)

res$entrez_id <- as.character(entrez_id_list)

gsea_data <- subset(res, !is.na(stat)) %>%
  subset(!is.na(FDR))

pathways <- reactomePathways(gsea_data$entrez_id)

ranked_genes <- gsea_data$stat
names(ranked_genes) <- gsea_data$entrez_id

fgseaRes <- fgsea(pathways, ranked_genes, maxSize = 500)

topPathwaysUp <- fgseaRes[NES > 0][head(order(padj), n = 10), pathway]
topPathwaysDown <- fgseaRes[NES < 0][head(order(padj), n = 10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(pathways[topPathways], ranked_genes, fgseaRes,
  gseaParam = 0.5
)

4 Biological relevance

COPD has two main components in its pathophysiological presentation: emphysema and chronic bronchitis. They are shown in the following flowchart:

DiagrammeR::grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      tab6 [label = '@@6']
      tab7 [label = '@@7']
      tab8 [label = '@@8']
      tab9 [label = '@@9']
      tab10 [label = '@@10']
      tab11 [label = '@@11']
      tab12 [label = '@@12', style= 'filled', fillcolor = '#E9967A']
      tab13 [label = '@@13', style= 'filled', fillcolor = '#E9967A']
      tab14 [label = '@@14']
      tab15 [label = '@@15']

      # edge definitions with the node IDs
      tab15 -> tab1 -> {tab2 tab4 tab6};
      tab2 -> {tab3 tab4};
      tab3 -> {tab8 tab9 tab10}
      {tab8 tab9 tab10} -> tab12;
      tab4 -> {tab5 tab7} -> tab11;
      tab6 -> tab11;
      tab4 -> tab3;
      tab11-> tab13;
      tab14 -> tab2;
      }

      [1]: 'Inflammation'
      [2]: 'Proteinase–antiproteinase imbalance'
      [3]: 'Destruction of alveolar walls and capillaries'
      [4]: 'Oxidative stress and inflamatory mediators'
      [5]: 'Fibrosis and thickening of bronchiolar walls'
      [6]: 'Mucus hypersecretion and ciliary dysfunction'
      [7]: 'Edema and smooth muscle contration in the airways'
      [8]: 'Enlarged air spaces'
      [9]: 'Impaired gas diffusion'
      [10]: 'Air trapping on expiration'
      [11]: 'Narrowing of small vessels'
      [12]: 'Emphysema'
      [13]: 'Chronic bronchitis'
      [14]: 'Serpin genetic deficiencies'
      [15]: 'Environmental expositions'
      ")

I’ll present the specific findings of the analysis in terms of how they relate to the course of development of the disease.

4.1 Emphysema

Environmental exposures such as smoking, secondhand smoke exposure, air pollution and occupational exposure to other dusts, fumes and chemicals are the leading risk factors for the development of COPD (Ruvuna and Sood 2020). They irritate the lungs, leading to an inflammation driven mainly by neutrophils and lymphocytes.

When recruited to the bronchioli and alveoli, neutrophils release elastases, enzymes that degrade the elastin in the alveolar walls and capillaries of the lungs. Neutrophils also produce reactive oxygen species (ROS), inflammatory mediators and cytokines that promote metalloproteinase (MMPs) activity which is associated with the destruction of the extracellular matrix (ECM) components and the aberrant remodelling of damaged alveoli while inhibiting tissue inhibitors of metalloproteinases (TIMPs) activity which is associated to the maintenance of ECM integrity. This constitutes the classic protease-antiprotease imbalance that leads to emphysema by the destruction of alveolar walls and capillaries with an enlargement of air spaces, impaired gas diffusion and air trapping on expiration (Laurell and Eriksson 2013).

The neutrophil pathologic role in COPD highly depends on the activity of the CXCL-8-CXCR1/2 axis (Ha, Debnath, and Neamati 2017), which was found to be upregulated in patients with the disease. Other genes in the same axis, such as CXCL12, IL1B, RELA and EGFR are also upregulated. As for the metalloproteinases, MMP1, MMP2, MMP10, MMP19 and MMP25 were upregulated. No TIMP was found to be downregulated. On the contrary, TIMP4 was upregulated. In the GEAS plot, we can see the comparatively low enrichment in COPD patients of genes related to the maintenance of the extracellular matrix organisation. This represents the smaller expression of antiproteinases and other constitutive elements that are key to the physiological maintenance of the ECM.

4.2 Chronic bronchitis

The second of the main pillars for the presentation of COPD is chronic bronchitis. The whole inflammatory environment at the lungs leads to mucus hypersecretion, ciliary dysfunction, fibrosis, thickening of the bronchiolar walls, oedema and smooth muscle contraction in the airways, producing a narrowing of the small airways that characterizes COPD’s chronic bronchitis.

We can observe the increase in mucin expression in COPD patients. From the nine main human mucin genes, we find that MUC4, MUC5B and MUC5AC are upregulated. MUC5AC is the main gene previously reported as overexpressed in COPD patients’ sputum (Rogers 2008).

4.3 Mitochondrial dysfunction

A less well-known dimension of the pathophysiology of COPD is its associated mitochondrial dysfunction. When faced with oxidative stress, the lung mitochondria present a decrease in membrane potential and presented abnormal expression regulation (Wiegman et al. 2015).

The GEAS analysis also demonstrates that many mitochondria-related pathways are altered in COPD patients. Prohibitin (PHB-1) has an important role in mtDNA regulation by stabilizing TFAM protein, a known protein component of the mitochondrial nucleoids (Kasashima et al. 2008). In this analysis, PHB-1 was found to be down-regulated and TFAM up-regulated in COPD patients, a combination that can explain at least partially the mitochondrial dysfunction. This increases the inflammatory response even more.

5 Bibliography

“Chronic Obstructive Pulmonary Disease (COPD) - World Health Organization.” 2020. https://www.who.int/news-room/fact-sheets/detail/chronic-obstructive-pulmonary-disease-(copd).
Ha, Helen, Bikash Debnath, and Nouri Neamati. 2017. “Role of the Cxcl8-Cxcr1/2 Axis in Cancer and Inflammatory Diseases.” Theranostics 7 (6): 1543–88. https://doi.org/10.7150/thno.15625.
Kasashima, Katsumi, Megumi Sumitani, Masaaki Satoh, and Hitoshi Endo. 2008. “Human Prohibitin 1 Maintains the Organization and Stability of the Mitochondrial Nucleoids.” Experimental Cell Research 314 (5): 988–96. https://doi.org/10.1016/j.yexcr.2008.01.005.
Laurell, Carl-Bertil, and Sten Eriksson. 2013. “The Electrophoretic α \(_{\textrm{1}}\) -Globulin Pattern of Serum in α \(_{\textrm{1}}\) -Antitrypsin Deficiency.” COPD: Journal of Chronic Obstructive Pulmonary Disease 10 (sup1): 3–8. https://doi.org/10.3109/15412555.2013.771956.
Rogers, Duncan F. 2008. “Mucus Hypersecretion in Chronic Obstructive Pulmonary Disease.” In Novartis Foundation Symposia, edited by Derek Chadwick and Jamie A. Goode, 65–83. Chichester, UK: John Wiley & Sons, Ltd. https://doi.org/10.1002/0470868678.ch5.
Ruvuna, Lisa, and Akshay Sood. 2020. “Epidemiology of Chronic Obstructive Pulmonary Disease.” Clinics in Chest Medicine 41 (3): 314–27. https://doi.org/10.1016/j.ccm.2020.05.002.
Sha, Ying, John H. Phan, and May D. Wang. 2015. “Effect of Low-Expression Gene Filtering on Detection of Differentially Expressed Genes in RNA-Seq Data.” Conference Proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference 2015: 6461–64. https://doi.org/10.1109/EMBC.2015.7319872.
Wiegman, Coen H., Charalambos Michaeloudes, Gulammehdi Haji, Priyanka Narang, Colin J. Clarke, Kirsty E. Russell, Wuping Bao, et al. 2015. “Oxidative Stress–Induced Mitochondrial Dysfunction Drives Inflammation and Airway Smooth Muscle Remodeling in Patients with Chronic Obstructive Pulmonary Disease.” The Journal of Allergy and Clinical Immunology 136 (3): 769–80. https://doi.org/10.1016/j.jaci.2015.01.046.